Question 3¶

Author: Michal Kubina¶

1. Description¶

Introduction¶

Firstly, I will start by downloading 50 videos per subset. I believe the more videos we have the broader the analysis can be. Moreover, I think it is essential to point out that I will compare the subsets by comparing their distribution of the features based on the colors because even one specific subset of the video can be very different data-wise. For example, the cinema had a very big technology advance in the last 20 years, and movies look differently at the start and the end of this era. Moreover, color movies started to appear in the 40s and 50s thus in the first subset should be just black and white movies.

Tackling dimensionality¶

Considering the vast dimensionality of the data I will start with scene detection. For this method, I have decided to set the threshold to 10 which produces more data points - frames. After collecting the middle frames of the scenes where each subset contains corresponding frames, I will come to the analysis itself. Since I am planning to do a comparison of the distribution of the data from subsets, it is not necessary to think about having the same number of frames in each subset. This issue of having different numbers of frames in subsets is very probable since the length and video capturing attitude changed during the times, so the number of scenes can be different too.

Explanation for choice of features:¶

I think the best way how to analyze the subsets from different periods is to analyze their color features from the distribution point of view. I believe that the quality of the graphics of trailers and movies changed during times. I think that the trailers in recent years had better image quality, thus more saturation and more hue. I will visualize the data and compare the subsets. Moreover, we have to take into account that the first subset contains black and white movies since the color movie started to evolve mainly in 1940. Moreover, I will also look at the distribution of lightness of the frames and distribution of median of RGB colors of each frame just from curiosity. The last feature I will use is the ratio of the video. I think it would be very interesting to compare evolution in width and height during the three time periods. For example, televisions have a different ratio of screen size now - more wide to be specific - than they had before.

2. Analysis¶

Initialize libraries:

In [1]:
import pandas as pd
import os
import wget
from tqdm.auto import tqdm
import cv2 
import numpy as np
from scenedetect import VideoManager
from scenedetect import SceneManager
from scenedetect.detectors import ContentDetector
import colorgram
from matplotlib.colors import to_hex
from tensorflow.keras.preprocessing import image
from PIL import Image
import plotly.express as px
from matplotlib import pyplot as plt
import plotly.offline as pyo
pyo.init_notebook_mode()

Make subsets:

In [2]:
movies = pd.read_csv("trailers.csv") #load csv
sub2040 = movies.loc[(movies.year >= 1920) & (movies.year <= 1940), ] #make subset 1920 - 1940
sub2040 = sub2040.sample(50, random_state = 42) #sample 50 trailers
sub6080 = movies.loc[(movies.year >= 1960) & (movies.year <= 1980), ] #make subset 1960 - 1980
sub6080 = sub6080.sample(50, random_state = 42) #sample 50 trailers
sub0020 = movies.loc[(movies.year >= 2000) & (movies.year <= 2020), ] #make subset 2000 - 2020
sub0020 = sub0020.sample(50, random_state = 42) #sample 50 trailers

Define useful functions:

In [3]:
def download_videos(subset, output_folder, subset_folder):
    """
    Downloads videos based on data in dataframe

    Parameters
    ----------
    subset : pandas dataframe
        dataframe with info about videos to be downloaded
    output_folder : string
        string with the name of the main subfolder
    subset_folder : string
        string with the name of the subset folder
        
    Returns
    ---------- 
    video_paths : list
        list of videopaths
    """
    if not os.path.exists(output_folder): #if the output folder does not exist
        os.mkdir(output_folder) #create folder
    if not os.path.exists(os.path.join(output_folder, subset_folder)): #if ouput and subset folder does not exist
        os.mkdir(os.path.join(output_folder, subset_folder)) #create folder
    
    video_paths = [] #initialize list
    
    for video in tqdm(subset.itertuples(), total=len(subset)): #loop through videos to be downloaded
        video_url = video.url #get video url
        output_path = os.path.join(output_folder, subset_folder, video.trailer_title + '.mp4') #get output path
        if not os.path.exists(output_path): #if path does not exist
            filename = wget.download(video_url, out=output_path) #get filename
        video_paths.append(output_path) #append video paths
        
    return video_paths 

def find_scenes(video_path, threshold):
    """
    Finds scenes in video

    Parameters
    ----------
    video_path : string
        path of the video
    threshold : int 
        threshold for scenes
        
    Returns
    ---------- 
    scenes
    """
    video_manager = VideoManager([video_path]) #intialize video manager
    scene_manager = SceneManager() #initialize scene manager 
    scene_manager.add_detector(
        ContentDetector(threshold=threshold)) #add detector
    base_timecode = video_manager.get_base_timecode() #add base time
    video_manager.set_downscale_factor()
    video_manager.start()
    scene_manager.detect_scenes(frame_source=video_manager, show_progress=False) 
    return scene_manager.get_scene_list(base_timecode)

def get_frames_for_set(paths):
    """
    Gets frame for set of videopaths

    Parameters
    ----------
    paths : list
        lists of videopaths
        
    Returns
    ---------- 
    list of frames
    """
    frames = [] #initialize list
    
    for filename in paths: #loop through all paths
        scene_list = find_scenes(filename, threshold=10) #get scene list
        cap = cv2.VideoCapture(filename) #initialize video capture

        shot_length = [] #initialize list

        for start_time, end_time in scene_list: #loop through all scenes
            duration = end_time - start_time #duration of the scene
            frame = (start_time.get_frames() + int(duration.get_frames() / 2)) #get middle of the frame
            cap.set(cv2.CAP_PROP_POS_FRAMES,frame) 
            ret, frame = cap.read()
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) 
            frames.append(frame) #add frame to the list
            shot_length.append(duration.get_seconds()) #add shot length to the list
    return frames

def load_image_from_path(image_path, target_size=None, color_mode='rgb'):
    """
    Loads image from path

    Parameters
    ----------
    image_path : string
        path of the image
    target_size : tuple
        target size of the image
    color_mode : string
        rgb or grayscale
        
    Returns
    ---------- 
    loaded imag
    """
    pil_image = image.load_img(image_path, 
                               target_size=target_size,
                            color_mode=color_mode) #load image
    return image.img_to_array(pil_image) #return image

def save_frames(frames, sub):
    """
    Saves frames as jpgs into the specific directory

    Parameters
    ----------
    frames : list
        list of frames to be saved
    sub : string
        string which specifies directory
        
    """
    s = 'scenes' + sub + '/' #get folder path
    
    if not os.path.exists(s): #if folder does not exist, create
        os.mkdir(s)
    
    for i, frame in enumerate(frames): #loop through frames
        cv2.imwrite((s + 'frame_{}.jpg'.format(i)), frame) #save frame as jpg
        
def get_subset_info(key):
    """
    Gets all the info for each subset 

    Parameters
    ----------
    key : string
        string which specifies directory
        
    Returns
    ---------- 
    coulurs : list of lists
        list of lists with median values of rgb per image
    ratios : list
        list of ratio values per image
    saturation : list
        list of saturation values per image
    hue : list
        list of hue values per image
    lightness : list
        list lightness values per image
        
    """
    directory = os.fsencode(key) #get all files in directory
    colours = [] #initialize list for rgb median values
    ratios = [] #initialize list for ratios
    saturation = [] #initialize list for saturation
    hue = [] #initialize list for hue
    lightness = [] #initialize list for lightness
    for file in tqdm(os.listdir(directory)): #loop through all the files
        filename = os.fsdecode(file) #get filename
        
        if filename.startswith('.'): #skip files which starts with dot (mainly ipynb checkpoints)
            continue

        color_image = load_image_from_path(key + filename,
                            color_mode='rgb') #load the image
        median_r, median_g, median_b = np.median(color_image, axis=(0,1)) #get median values of the image
        colours.append([median_r, median_g, median_b]) #append to the list
        
        width = color_image.shape[1] #get width of the image
        height = color_image.shape[0] #get height of the image

        aspect_ratio =  width / height #compute aspect ratio
        ratios.append(aspect_ratio) #append to the list
        
        color_image_hsv = cv2.cvtColor(color_image, cv2.COLOR_RGB2HSV)
        median_h, median_s, median_v = np.median(color_image_hsv, axis=(0,1)) #get median of hue, saturation and lightness
        hue.append(median_h) #append hue value to the list
        lightness.append(median_v) #append lightness value to the list
        saturation.append(median_s) #append saturation value to the list
    return colours, ratios, saturation, hue, lightness 

def visualize_boxplots(data, main, x, y):
    """
    Visualize boxplots of data

    Parameters
    ----------
    data : dictionary
        dictionary filled with data
    main : string
        name of the graph
    x : string
        name of the x axis
    y : string
        name of the y axis
        
    """
    df = pd.DataFrame(dict(stamp = np.concatenate((["1920-1940"]*len(data["1"]), ["1960-1980"]*len(data["2"]), ["2000-2020"]*len(data["3"]))),
                           vals = np.concatenate((data["1"],data["2"], data["3"])))) #create dataframe
    fig = px.box(df, x="stamp", y="vals") #create boxplot
    fig.update_layout(title = main, #update layout
                      xaxis_title = x,
                      yaxis_title = y)
    return fig

def convert_array_of_arrays(array, index):
    """
    Get list of specific color from rgb array of arrays

    Parameters
    ----------
    array : list
        contains rgb values, list of list
    index : int
        0 - r value, 1 - g value, 2 - b value
    """
    result = [] #initialize list
    for i in range(0, len(array)): #loop through all rgb values
        result.append(array[i][index]) #get specific color part
    return result

def visualize_rgb(data, main, x, y, index):
    """
    Visualize boxplots of data

    Parameters
    ----------
    data : dictionary
        dictionary filled with data
    main : string
        name of the graph
    x : string
        name of the x axis
    y : string
        name of the y axis
    index : int
        specifies the subset
    """
    df = pd.DataFrame(dict(subset = np.concatenate((["1920-1940"]*len(data["1"]), ["1960-1980"]*len(data["2"]), ["2000-2020"]*len(data["3"]))),
                           col = np.concatenate((convert_array_of_arrays(data["1"], index),convert_array_of_arrays(data["2"], index),  convert_array_of_arrays(data["3"], index)))))
    #construct sufficient dataframe for plotting 

    fig = px.box(df, x="subset", y="col") #create box plot
    fig.update_layout(title = main, #update fig layout
                      xaxis_title = x,
                      yaxis_title = y)
    fig.show() #show fig

Download videos:

In [4]:
videos = {} 
videos[1] = download_videos(sub2040, 
                              output_folder='videos',  # video will be in the folder 'videos'
                              subset_folder='1920_1940') # and in the folder 1920-1940)
videos[2] = download_videos(sub6080, 
                              output_folder='videos',  # video will be in the folder 'videos'
                              subset_folder='1960_1980') # and in the folder 1920-1940)
videos[3] = download_videos(sub0020, 
                              output_folder='videos',  # video will be in the folder 'videos'
                              subset_folder='2000_2020') # and in the folder 1920-1940)
  0%|          | 0/50 [00:00<?, ?it/s]
  0%|          | 0/50 [00:00<?, ?it/s]
  0%|          | 0/50 [00:00<?, ?it/s]

Prepare frames from video:

In [5]:
frames = {} #initialize dictionary
frames["1"] = get_frames_for_set(videos[1]) #get frames for first subset
frames["2"] = get_frames_for_set(videos[2]) #get frames for second subset
frames["3"] = get_frames_for_set(videos[3]) #get frames for third subset

Save frames and path of directories:

In [6]:
path = {} #initalize dictionary
for key in frames: #loop through dictionary
    save_frames(frames[key], key) #save all the frames
    path[key] = 'scenes' + key + '/' #store dictionary of the frames

Get all the info about subset:

In [7]:
medians = {} #initialize dictionary of medians
saturation = {} #initialize dictionary of saturation
hue = {} #initialize dictionary of hue
lightness = {} #initialize dictionary of lightness 
ratios = {} #initialize dictionary of ratios
for key in path: #loop through the paths of the subset
    medians[key], ratios[key], saturation[key], hue[key], lightness[key] = get_subset_info(path[key]) #get info of the subset
  0%|          | 0/3625 [00:00<?, ?it/s]
  0%|          | 0/6636 [00:00<?, ?it/s]
  0%|          | 0/6196 [00:00<?, ?it/s]

Visualize ratios:

In [8]:
ratios_graph = visualize_boxplots(ratios, "Graph of ratios distribution", "subset", "ratios")
ratios_graph.show()

saturation:

In [9]:
sat_graph = visualize_boxplots(saturation, "Graph of saturation distribution", "subset", "saturations")
sat_graph.show()

Visualize hue:

In [10]:
visualize_boxplots(hue, "Graph of hue distribution", "subset", "hue").show()

Visualize lightness:

In [11]:
visualize_boxplots(lightness, "Graph of lightness distribution", "subset", "lightness").show()

Visualize medians of red color:

In [13]:
visualize_rgb(medians, "Medians of reds", "subset", "r colour", 0)

Visualize medians of green color:

In [14]:
visualize_rgb(medians, "Medians of blues", "subset", "b colour", 1)

Visualize medians of blue color:

In [15]:
visualize_rgb(medians, "Medians of greens", "subset", "g colour", 2)

3. Interpretation and conclusion¶

Looking at the distribution of ratios in the corresponding plot we can see that during 1920-1940 the ratio was mainly around 1.3. In the second subset, there are sometimes even square ratios (1.0). However, during the upcoming years, there was a change and in 2000-2020 the frame is always wide. Looking at saturation plot we can find some interesting results. The saturation is very low for the oldest data set. This might be a consequence of using just black and white color. Simultaneously, it could suggest that technology evolved and the pictures are full of more quality colors which have more saturation.

Surprisingly, looking at the distribution of medians of red, green, and blue color, we cannot see any big differences in the distribution of the data. Looking at the distribution of hue we can see that the median is at 0 which confirms my hypothesis that the first and oldest subset contains mainly black and white pictures.

To conclude, I believe that color-wise the subset from 1920 to 1940 differ greatly from both other subsets. However, the two other subsets themselves can be considered similar when doing color analysis as the distribution of features is nearly the same. However, if one would look at the trailers, there could be seen a difference. Thus, in the future, a different method could be tried to compare the pictures in order to see if we can distinguish these two data sets - 2000-2020 and 1960-1980.